Skip to content

Conversation

@eddyxu
Copy link

@eddyxu eddyxu commented Dec 24, 2025

Add lance format as one of the packaged_modules.

import datasets

ds = datasets.load_dataset("org/lance_repo", split="train")

# Or

ds = datasets.load_dataset("./local/data.lance")

@eddyxu
Copy link
Author

eddyxu commented Dec 24, 2025

Mentioned #7863 as well

@zhe-thoughts
Copy link

@pdames for vis

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented Dec 29, 2025

Cool ! I notice the current implementation doesn't support streaming because of the symlink hack.

I believe you can do something like this instead:

def _generate_tables(self, paths: list[str]):
    for path in paths:
        ds = lance.dataset(path)
        for frag_idx, fragment in enumerate(ds.get_fragments()):
            for batch_idx, batch in enumerate(
                fragment.to_batches(columns=self.config.columns, batch_size=self.config.batch_size)
            ):
                table = pa.Table.from_batches([batch])
                table = self._cast_table(table)
                yield Key(frag_idx, batch_idx), table

note that path can be a local one, but also a hf:// URI

@eddyxu
Copy link
Author

eddyxu commented Jan 6, 2026

@lhoestq Take another look?

@lhoestq
Copy link
Member

lhoestq commented Jan 8, 2026

I took the liberty to make a few changes :)

Now I believe we should be good:

  • both local and streaming work fine
  • both dataset and single files work fine
  • all files are properly downloaded now than all files and metadata files are included in config.data_files
  • sharding is supported:
    • dataset: one shard = one fragment
    • single files: one shard = one file
  • streaming dataset resuming works fine thanks to Key()
  • the two hacks are visible and with TODOs to remove them when possible
    1. remove the revision in HF uris since only "main" is supported
    2. write proper _version/* files since lance doesn't work if they are symlinks

I think this PR is ready, just let me know what you think before we merge 🚀

The next steps are:

Feel free to start some drafts (I noticed there are great examples in your HF account now !), I'll be happy to review :)

And once Lance is available in huggingface.js and docs are ready we'll be ready to enable the Dataset Viewer and Lance code snippets on HF !

yield Key(frag_idx, batch_idx), self._cast_table(table)
else:
for file_idx, lance_file in enumerate(lance_files):
for batch_idx, batch in enumerate(lance_file.read_all(batch_size=self.config.batch_size).to_batches()):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we support columns pushdown here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added it at LanceFileReader() initialization, since the argument is not available in read_all()

Copy link
Author

@eddyxu eddyxu Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, how does it work with multiple data files within same fragment?

In Lance, one fragment can be 1 or more data files, where each data files cover a few columns. This is how we can add new features / column cheaply without rewriting the datasets, by adding new data files to existing fragment.

Maybe we can address it as follow up tasks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case it's a dataset no ? since it requires a manifest or something to tell what the fragments are made of

LanceFileReader() is only used for single files, i.e. that don't belong to a lance dataset directory or require manifest files

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Lets 🚢 !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants